Identification of microbial markers associated with lung cancer based on multi‐cohort 16 s rRNA analyses: A systematic review and meta‐analysis

Abstract Background The relationship between commensal microbiota and lung cancer (LC) has been studied extensively. However, developing replicable microbiological markers for early LC diagnosis across multiple populations has remained challenging. Current studies are limited to a single region, single LC subtype, and small sample size. Therefore, we aimed to perform the first large‐scale meta‐analysis for identifying micro biomarkers for LC screening by integrating gut and respiratory samples from multiple studies and building a machine‐learning classifier. Methods In total, 712 gut and 393 respiratory samples were assessed via 16 s rRNA amplicon sequencing. After identifying the taxa of differential biomarkers, we established random forest models to distinguish between LC populations and normal controls. We validated the robustness and specificity of the model using external cohorts. Moreover, we also used the KEGG database for the predictive analysis of colony‐related functions. Results The α and β diversity indices indicated that LC patients' gut microbiota (GM) and lung microbiota (LM) differed significantly from those of the healthy population. Linear discriminant analysis (LDA) of effect size (LEfSe) helped us identify the top‐ranked biomarkers, Enterococcus, Lactobacillus, and Escherichia, in two microbial niches. The area under the curve values of the diagnostic model for the two sites were 0.81 and 0.90, respectively. KEGG enrichment analysis also revealed significant differences in microbiota‐associated functions between cancer‐affected and healthy individuals that were primarily associated with metabolic disturbances. Conclusions GM and LM profiles were significantly altered in LC patients, compared to healthy individuals. We identified the taxa of biomarkers at the two loci and constructed accurate diagnostic models. This study demonstrates the effectiveness of LC‐specific microbiological markers in multiple populations and contributes to the early diagnosis and screening of LC.

Lung cancer (LC) is associated with the fastest-growing incidence and mortality rates, as compared to other cancers, and represents a severe threat to the health and life of individuals. 1,2The early diagnosis rate of LC is about 15%, and most patients present with locally advanced or metastatic disease on diagnosis, which results in a poor prognosis. 3Therefore, early screening is pivotal for reducing LC-related deaths and improving patient survival rates. 4However, the acceptance rate of screening is low due to the invasiveness and high cost of current clinical examination methods.Therefore, we urgently need to develop a convenient and cost-effective strategy for LC diagnosis. 5ith the advent of precision medicine, artificial intelligence (AI) has come under increased scrutiny. 6Machine learning (ML), an important AI technique, shows a compelling performance in clinical aspects such as the diagnosis, 7 prognosis, 8,9 and treatment of diseases.In the US, the American Food and Drug Administration has approved AI-based medical algorithms for diagnosing and evaluating lung nodules and LC. 10 However, the medical data were based on pathology, radiology, and endoscopy reports.Hence, we attempted to examine whether ML could be used to develop cancer diagnostic models using non-intrusively obtained samples.The development of early screening methods that are well accepted by the public will accelerate the translation of results obtained using the ML model into real-world applications.
It had already been reported as early as 2015 that microbial dysbiosis in specific organs could be involved in carcinogenesis.Microbial dysbiosis could affect the host immune system, change the balance between the proliferation and death of host cells, and produce carcinogenic metabolites. 11The term "gut-lung axis" has been coined in recent years to explain the emerging pathogenic links between LC and microbiota. 12Subsequently, our focus has shifted to the contributions and implications of microbial species in both microbial niches on LC progression and exacerbation.The lung microbiota (LM) and gut microbiota (GM) of the LC population have been studied in many countries and regions. 13,14Zheng et al. conducted a meta-analysis of 8 LM sequencing studies involving 530 participants.The results indicated that a higher abundance of the Actinobacteria and Firmicutes phyla occurred in the cancer group compared to the normal control group.At the genus level, there were significant differences in the abundance levels of specific bacteria (e.g., Prevotella and Streptococcus) in two groups. 15Consequently, relevant studies have described the construction of diagnostic models using ML algorithms based on the sequencing data generated using LM DNA samples obtained from LC patients.According to the results reported by Jin et al., the area under the curve (AUC) value of the diagnostic panel built with random forest (RF) regression analysis could increase to 0.882. 16Thus, LM is a promising diagnostic biomarker.The gut dysbiosis characteristic of GM has also been extensively reported in LC patients. 17,18Modeling studies based on patient GM DNA datasets also represent a popular research topic due to the easy viability of fecal samples and significant differences in the DNA between groups of individuals with and without cancer.][21][22][23] Although there has been considerable progress in the study of micro biomarkers associated with LC, these studies have been limited to a single region or subtype.No direct evidence indicates the microbiota that would result in the accurate and early detection of LC.There is also no consensus on which genus can be used most effectively for LC screening.To address the above two questions, we conducted a meta-analysis involving LC populations with varied regions, types, and stages.Studies were conducted with bronchoalveolar lavage fluid (BALF) and bronchial brush samples (as a proxy for LM), and fecal (as a proxy for GM) samples to compare the accuracy of GM and LM markers.We also attempted to elucidate the variations in microbial function to provide theoretical support for further exploring the potential mechanisms by which microorganisms play a role in LC carcinogenesis.Our study demonstrates the validity of LC-specific micro markers in multiple populations, and our findings would help guide the use of microbes as biomarkers for assessing LC progression and developing targeted therapies.

| Search strategy
Two reviewers (WH and YW) from the research team were responsible for conducting systematic literature searches in the Pubmed and Embase databases and performing biological data mining from the Sequence Read Archive (SRA) and European Bioinformatics Institute databases.There were no restrictions on the language or year of publication during retrieval, and information was most recently updated on June 1, 2022.The detailed search strategy is described in Supplementary Section 1.

| Inclusion and exclusion criteria
The criteria for selecting articles were as follows 1 : Analysis of GM was performed using fecal matter as the study sample, and it was collected after LC diagnosis and before the patient received treatment. 2Representative samples of LM were mainly BALF and bronchial brush samples collected from patients subjected to a bronchoscopy to evaluate lung disease. 3The selection of LC patients was not limited by subtype, stage, and smoking history. 4The benign group included patients with clinically confirmed benign pulmonary nodules according to disease guidelines. 5Normal control individuals had no respiratory symptoms, chest X-ray abnormalities, or history of lung disease.
The following studies were excluded from the following characteristics 1 : Studies unrelated to the topic 2 ; case reports, comments, or review articles 3 ; studies in which raw 16srRNA gene sequencing data were not available publicly or could not be grouped clearly 4 ; studies involving patients who had received treatment were excluded.Controls with a history of cancer or recent (less than 1 month) use of antibiotics were excluded.

| Data extraction and quality assessment
Two reviewers (WH and NW) independently extracted the following information from each study: authors, country, publication year, sample type, sample size, grouping, microbiological assessment method, and NCBI BioProject ID.Two reviewers (WH and MH) independently assessed quality using the ROBIS tool and the Joanna Briggs Institute Critical Appraisal Checklist (Tables S1 and S2; Figure S1).Discrepancies, if any, were discussed with the third reviewer (JX) to reach a consensus.

| Data preprocessing
Sequencing and sample data were downloaded from the NCBI SRA project.After the raw FASTQ files were downloaded, they were de-multiplied, and Vsearch software was used (v2.18.0) to merge raw data, shear primers, and barcodes and filter out low-quality data. 24The generated clean data were merged again, and the feature table and representative sequence were obtained by performing de-redundancy and denoise processing.Finally, operational taxonomic units (OTUs) with relative abundance levels greater than 1/10000 were selected as the final representative sequence.Based on this sequence, we used QIIME1, a plugin-based platform, for microbiome analysis and clustering.We classified the relative abundance of the species into five levels: phylum, class, order, family, and genus.Then, the Greengenes (v13.8)database was used to conduct taxonomic annotation. 25he generated rarefied data set was used for downstream analyses.

| Data analyses
R version 3.6.1 was used for all downstream bioinformatics analysis and data visualization.

| Confounder analysis
We used ANOVA-type analysis to quantify the effects of some confounding factors and disease statuses on a single microbial species.The total variance for a given OTU was compared to the variance attributable to confounders (age, blood glucose level, alcohol consumption history, smoking history, sex, body mass index [BMI], and study), and the variance was attributable to the disease status (canceraffected and normal control), like that observed for a linear model.Considering the non-Gaussian distribution of microbiome abundance data, variance calculations were performed based on ranks. 26Confounders with continuous values were converted to categorical data either as quartiles or according to conventional cutoffs for glucose levels to classify individuals as those with hypoglycemia (<3.9 mmol/L), normal glucose levels (3.9-6.0 mmol/L), and hyperglycemia (>6.0 mmol/L).

| Diversity analyses
The R package "Vegan" was used for diversity analyses.During αdiversity analysis, we mainly examined the Chao 1, Shannon, and Simpson indexes.We used the Wilcoxon rank sum test to examine the significance of differences between the two groups.We used an analysis of variance (ANOVA) with an honestly significant difference (HSD) test to examine differences between multiple groups.The R package "Phylose" was used for βdiversity analysis to perform principal coordinate analysis (PCoA) based on the Bray-Curtis dissimilarity matrix.The analysis of similarities (ANOSIM) was conducted to evaluate the statistical significance, and a box plot was used to visualize the results.

| Differential abundance analysis
First, the Wilcoxon rank sum test was used to detect specific species between groups.The cutoff value was a log value >2.0 and p < 0.01 in the Wilcoxon rank sum test.The LEfSe was then used to determine the "microbiome-Marker."Finally, hypothesis testing was performed to assess the significance of the observed differences.The above steps were performed using the "LEfSe" package.

| Co-occurrence and clustering analysis
To further analyze the co-occurrence of microbiota, we used the R package "co-occur" to calculate the Spearman correlation between different genera within a group.The R package "psych" was used to screen significant and robust connections (p-value <0.05, | ρ | ≥0.3).The network graph was then visualized using Gephi (v0.9).

| Function prediction analysis
PICRUSt (v2.4.2) software was used to predict the functional gene profile based on a previous OUT table.Then, the related gene description information could be annotated from the KEGG database to obtain a functional abundance spectrum.Differential analysis of function was assessed by STAMP software (Welch's t-test; p < 0.05).

| Random forest model construction and evaluation
We used ML algorithms to construct a classification model and distinguish samples from various groups.First, we split the genus-level relative abundance dataset into a training set (70%) and a test set (30%), with the training set and test set used for training and performance validation, respectively.We used three algorithms for model training: K-nearest neighbors (KNN), Support Vector Machine (SVM), and RF to find the most suitable ML method for this study.According to the AUC values obtained with each algorithm and the microbiome-based classifier used in a previous study, 27,28 we finally chose the RF R package to construct the predictive model.The hyperparameters ntree = 800 and mtry= √ p (number of variables) were set during the analysis, and the other settings were defaulted parameter settings.We identified core biomarkers using "Mean Decrease Accuracy" as the screening index according to importance-based rankings.Then, we performed 10-fold cross-validation of the RF model while obtaining model error values.The receiver operating characteristic curve (ROC) was plotted with the "pROC" package, and the AUC was calculated to evaluate the diagnostic ability of the model.The ML algorithm selection process and tuning hyperparameters are described in Supplementary Section 2.
To verify the applicability of the model in different contexts, we performed study-to-study transfer validation and leave-one-dataset-out (LODO) validation.During study-tostudy transfer validation, one item was set in the test set, and another was set in the training set (LODO validation: the remaining items are confluent as the training set).Then, AUC cross-tabulations were performed based on the screened core biomarker. 29In addition to internal validation, independent studies and individuals with lung disease need to be involved during external validation to demonstrate the reproducibility and specificity of the model.For the 16 s sequence type, external data could be applied directly to the test dataset, and predictions were summarized to the ROC values.For whole genome sequencing (WGS) data, species consistent with the results of the 16 s model were screened, and their abundance levels were substituted into the 16 s model to investigate the AUC value.

| Overview of the study cohort characteristics in the meta-analysis
Based on the inclusion criteria, 195 articles retrieved from PubMed.gov and EMBASE were critically reviewed.Figure S2 shows the selection process for this study.A total of seven studies met our inclusion criteria.These studies involved populations from China, South Korea, the US, and six European countries (UK, Germany, Italy, Poland, Hungary, and Portugal).We collected 712 gut samples (fecal) and 393 lung samples (343 BALF and 50 bronchial brush samples), including samples for the validation cohort.Detailed information regarding all the cohorts used in this meta-analysis is listed in Table 1.The sample processing procedures and analytical methodology of each study are summarized in Table S3 and Table S4, respectively.

| Confounder analysis of the microbiomes associated with LC
Due to the technical and biological differences among the included studies, we first quantified the influence of the confounders associated with the studies on microbiome composition.The results showed that the variance attributable to the factor "study" was more significant than that attributable to disease status and other confounding factors, which had the highest impact on microbial composition (Figure 1, Figure S3).Therefore, we used "study" as a blocking factor and used a two-sided blocked Wilcoxon rank sum test to adjust for batch effects.The differential OTUs with the most negligible impact from the "study" were selected for subsequent analysis.In Figure 2A, the indices for the LC group were always slightly lower than those for the normal group, which was in accordance with previous findings. 19This indicates that the decrease in species diversity is one of the manifestations of gut dysbiosis in LC patients.We evaluated the differences and similarities between the two groups, that is, we conducted a β diversity analysis (Figure 2B).The results of PCoA analysis showed that there was a notable level of variability between the two groups when PCoA 1 (11.29%) and PCoA 2 (5.36%) were used as the abscissa and ordinate, respectively (p = 0.001).ANOSIM analysis also showed that there were significant differences between the LC and normal groups (R > 0, p = 0.001) (Figure S4A).In conclusion, our meta-analysis revealed significant differences in the species diversity and composition of the GM between LC patients and controls.We annotated the species of all OTUs and identified 11 phyla, 17 orders, 21 families, 42 families, and 89 genera.At the phylum level, Firmicutes were the prominent GM members in the LC group, followed by Bacteroidetes, Proteobacteria, and Actinobacteria.The dominant phyla in the normal group were similar to those in previous studies and did not differ significantly from the LC group (Figure 2C).However, the differences were more evident at the genus level.In the LC group, the abundance levels of Bacteroides and Escherichia increased significantly, while those of Prevotella and Coprococcus decreased significantly (Figure 2D).We performed multi-LEfSe to identify statistically significant biomarkers.The results showed that speciesrelated differences could mainly be attributable to the presence of Prevotella, Faecalibacterium, and Enterococcus (Figure 2E).Some bacterial taxonomic clades were significantly different in the LC and control groups (log10 [LDA score] > 2) (Table S5).Eight genera, including Enterococcus, Lactobacillus, Escherichia, and Streptococcus, were markedly enriched in the LC group.These might represent important markers for the early screening of LC.Then, we analyzed the evolutionary   In the meta-analysis, samples were obtained from the involved airway (the lung nodule segment) or the uninvolved airway (usually in the lobe contralateral to the suspicious nodule).To ensure the accuracy of the results, we performed PCoA analysis for these samples (Figure S5).The results showed no significant differences in the microbial composition of the airways involved and uninvolved with LC, and the same results were also observed for benign samples.Therefore, we combined all lower respiratory samples for performing the subsequent analyses.
We also performed a diversity analysis, and Figure 3A shows that the α diversity gradually increased with disease development.The Simpson index is the most notable index that showed value changes.PCoA analysis showed more pronounced intergroup differences in the LM than in the GM (Figure 3B).ANOSIM analysis also confirmed these results (Figure S4B).The species mainly belonged to dominant phyla such as Proteobacteria, Firmicutes, and Bacteroidetes, similar to those in the gut but present at slightly different proportions (Figure 3C).At the genus level, it can be seen that the dominant genera were significantly different between the normal control group and the disease group (LC and benign).Prevotella and Rhodanobacter levels were reduced considerably in the disease population (Figure 3D).
We also performed LEfSe to reveal potential differential microbial taxa among the three groups (Figure 3E, F).At the genus level, Escherichia, Staphylococcus, and Bacillus were representative markers of the LC group, and Prevotella and Pseudomonas characterized the normal control and benign groups, respectively (Table S6).

| Association and differences between GM and LM in lung cancer
As shown in Figure 4A, the microbial diversity was significantly higher in the gut than in the lung, in the normal control and LC groups.Interestingly, we found an overlap between the biomarkers of the LC group at two loci and identified three shared biomarkers: Enterococcus, Lactobacillus, and Escherichia (Figure 4B).All three genera may play essential roles at the two sites of tumorigenesis.

| Co-occurrence network analysis of microbiota
In order to thoroughly understand the correlation among microorganisms in each group and the strength of their interactions, we constructed a co-occurrence network based on previous results.The number of nodes and association density of the GM group (Figure 5A, B) were significantly lower than those of the LM group (Figure 5C-E), indicating that the relationship with LM is closer than that with GM and involved a more comprehensive range of bacteria.
With regard to LM, the LC group (Figure 5E) had more edges and higher network densities than the normal control (Figure 5C) and benign groups (Figure 5D).This suggests that disease occurrence enhances the original interactions between flora.The same phenomenon can be observed in GM, and the correlation curve was thickened in the LC group (Figure 5B).In addition, the size of the nodes also revealed that some essential bacterial genera potentially acted as key hubs in the community.For example, Oscillospira was more closely associated with other genera in the gut LC group.Among the lung LC group, Veillonella, Actinomyces, and Oribacterium had higher centrality levels.
Overall, interaction-based relationships were observed to a more significant extent in Proteobacteria and Firmicutes.Firmicutes contain many butyrate-producing bacteria, and many studies have revealed that butyrate has antitumor properties. 30,31This indicates that some strains of Firmicutes are likely to be involved in antagonism and competition with pathogenic bacteria, leading to more complex networks in the community, which also explains the coarser correlation curves and a stronger degree of clustering of Firmicutes species in the LC group.The AUC of the GM-based classifier model was 0.81 in the LC group versus the normal control group, based on the 36 feature variables (Figure 6A).The LM-based classifier model had a higher ability to distinguish between individuals with and without cancer based on the 26 feature variables (AUC = 0.90, Figure 6B).We also found that LM had good diagnostic capabilities and could effectively distinguish the LC group from the benign group based on the seven feature variables (AUC = 0.81, Figure 6C).The comprehensive performance indicators of three RF models on the testing dataset are shown in Table 2. Our results show that both sites of microbial markers have an excellent diagnostic value, and the performance of LM when used for disease prediction was better than that of GM.

| Validation of the robustness of the microbial classifier
We performed study-to-study transfer and LODO validation within the included projects to test whether these two classification models are universal and robust across multiple studies.In the GM classifier, the AUC values for study-to-study transfer validation ranged from 0.59-0.93,with a mean of 0.66 (Figure 6E).Notably, a relatively higher testing value (AUC = 0.93, mean 0.73) was observed for the Seoul group as a training set, which can be explained by the relatively large sample size of its dataset.The results of the LODO analysis showed that the AUC of the gut microbial classifier ranged from 0.62-0.74(average AUC = 0.66).To confirm the results of 16 s rRNA gene sequencing, we included two additional independent cohorts for external Validation (Figure 6D).The RF model of the independent cohort resulted in AUC values of 0.86 and 0.78, respectively.In addition, we examined the clinical applicability of the GM model.The analysis showed that the BMI significantly affected the model results, and individuals with high BMIs had the best model predictions (Figure 6F).

| Assessing the specificity of predictive models
In order to reduce the occurrence of false positive results in clinical diagnosis, it is necessary to further confirm the specificity of the predictive model.In this analysis, we considered six non-LC diseases, including tuberculosis (TB), asthma, idiopathic pulmonary fibrosis (IPF), chronic hypersensitivity pneumonitis (CHP), interstitial lung disease (ILD), and chronic obstructive pulmonary disease (COPD).For GM models, as seen from the box plot for AUC, the AUC values for the non-LC disease models were significantly lower than for the LC model (Figure 6G).The LC versus normal model based on LM also showed good specificity (Figure 6H), while the LC versus Benign model assessment was poor (Figure 6I).These results emphasize that the markers used to distinguish between LC patients and normal controls are specific and exclusive, without interference from associated lung disease.However, it is insufficient to differentiate among lung disease types while attempting to perform a more precise LM-based analysis.It is evident that individuals with lung diseases have remarkably similar states of LM dysbiosis and may have some common pathogenic bacteria.

| Altered microbial functions in lung cancer
Overall, the functional differences in GM (Figure 7A) between LC patients and controls were less significant than those with LM (Figure 7B).This may be attributable to the location of LM being close to the lesion and the considerable effect of the pathological lung environment.
We identified 50 and 75 marker pathways in the gut and lung, respectively.Among the GM marker pathways, we found 21 pathways, which included pathways for secondary bile acid biosynthesis, tetracycline biosynthesis, and lipoic acid metabolism, which were upregulated in the LC group (Table S7).In contrast, functions associated with bacterial chemotaxis and flagellar assembly were decreased.Among the 75 marker pathways in the LM, we found that pathways related to the phosphotransferase system and D-arginine and D-ornithine  metabolism were enriched in the LC group (Table S8).

Model
The pathways related to lipopolysaccharide biosynthesis were downregulated in the LC group.Finally, we tried to identify the functional differences and relationships between the two sites and found seven common upregulated pathways (Figure S6A).These pathways may play a key role in LC carcinogenesis.Although the above results were obtained from PICRUSt 2.0, the results suggest that different degrees of metabolic reprogramming might occur at different sites in the microbiotas during LC progression.And the prediction function of WGS data also confirmed the accuracy of some functions of 16 s prediction (Figure S6B).

| DISCUSSION
This study comprehensively assessed the capability of LM and GM for early LC detection.Our results show that GM and LM exhibit good predictive ability for LC screening.In addition, we found that LM also exhibited good performance in distinguishing between benign lung disorders and LC.Still, a subsequent specificity-related validation showed that the model was susceptible to interference from other lung diseases.Finally, we constructed two microbial classification models to screen the LC and normal groups (AUC gut = 0.81; AUC lung = 0.90).Considering the non-invasiveness, convenience of use, and cost-effectiveness, we believe that the GM model is more suitable for developing a new method for early screening.
We also conducted a multi-level validation process for the gut model, and the results conclusively proved the robustness of the classifier.The results provide evidence for the feasibility of the use of GM for the non-invasive diagnosis of LC.
The predictive performance of GM as an independent diagnostic tool has been demonstrated in more than 20 diseases. 32For LC, the diagnostic ability of the GM model constructed by Wang et al. (AUC = 0.85) was slightly higher than that of our model (AUC = 0.81). 19I G U R E 7 Altered functions in the gut (A) and lung (B) microbial communities.KEGG pathway with significant differences between the Lung cancer group and the Normal group.KEGG, Kyoto Encyclopedia of Genes and Genomes.
Through methodological comparison, we found that their study used logistic regression analysis.The linear simulation associated with regression analysis is simpler and faster to run, but confidence interval coverage may deteriorate at higher levels of data complexity. 33As for the meta-data, the RF algorithm can effectively evaluate the accuracy of various features during the classification process and help us obtain reliable results for missing values. 34Importantly, research by Qi et al. showed that RF had superior performance compared to other multiclass classifiers (KNN, SVM, graph convolutional neural network, and multi-layer perceptron) when GM data was used for training. 276][37][38] Therefore, we believe that the RF model can solve our problems more effectively.In addition, Lu et al. and Lim et al. also adopted the RF algorithm, and the prediction accuracy was >0.7. 20,23Again, the feasibility of this algorithm is demonstrated.However, very few of these classification rules have been tested in independent studies.In other studies that examined the same problem/ data, the included population was limited to a single LC subtype or the local area, which may not be a good explanation of commonality representative of multiple populations with LC. 19,20 We included all subtypes of LC and several regions in our study, because of which our findings have more substantial applicability and repeatability.For gut samples, the dominant phylum and genus identified by our meta-analysis were in accordance with those identified in previous studies; the proportion alone is slightly different. 19This may be attributable to inevitable clinical factors, such as the sample types, regional differences, and analytical tools.We performed LEfSe analysis to identify the taxa of gut biomarkers between the two groups.The results showed that Enterococcus was the most common marker.Notably, it was also screened as a biomarker during LM analysis.Zhuang et al. reached the same conclusion in their study. 39Current studies on Enterococcus have focused on exploring its association with disease prognosis.In a WGS study of GM, Enterococcus casseliflavus was shown to serve as a biomarker of response to chemotherapy in LC patients. 40In addition, an article published in Science also noted that the occurrence of Enterococcus prophage in the gut of LC patients is significantly associated with the long-term benefits of PD-1 blockade therapy. 41However, it has also been shown that Enterococcus has a growth-promoting effect on A549 cells (a non-small cell lung cancer cell) while altering its stiffness. 42Thus, it is essential to be aware of the potential harm caused by Enterococcus as a cancer promoter and take advantage of its benefits in adjuvant therapy, in order to maximize the benefits of the microorganism to the host.
In addition to GM, other types of microbiota also showed powerful classification ability.For example, Lu et al. used data from sputum microbiota as the original data for modeling (AUC = 0.75). 20However, microorganisms are not the best classifying factor for sputum samples, and microRNA-21 and TMEM196, isolated from the sputum, could provide more accurate diagnostic results. 43,44n another study, Veillonella and Capnocytophaga in the saliva could yield a receiver operating characteristic value of 0.86. 45However, because sputum and saliva are susceptible to contamination with oral microbiota at the time of collection, the accuracy of the findings is questionable. 46heoretically, the LM is an ideal sample that could be used to study LC, as it is the closest to the lesion in a spatial location.It reflects the close relationship between the microbiome and the disease most accurately.Previous studies have demonstrated that BALF samples could reflect the LC tissue microbiota more effectively than sputum samples. 47Bello et al. showed that the development of the model also proves the high diagnostic value of microorganisms identified during a bronchial biopsy (AUC = 0.89). 48lthough LM can help obtain a definite histological diagnosis of LC, it may cause bleeding, resulting in an invasive injury and extreme discomfort during sampling. 49Hence, this sample type is unsuitable for screening the general population and is more appropriate for individuals known to have lung disease.
In our lung sample study, the lung microbiome was more similar in patients with lung disease and lung cancer, and Jin et al. obtained a conclusion consistent with ours. 16Furthermore, the diagnostic model they developed had an AUC value of 0.88.It is important to note that they used WGS data.The WGS data are more accurate than the 16 s sequencing we used to identify species. 50,51ut we have the advantage of having multi-cohorts and low economic costs.Regarding alpha diversity, the previous meta-analysis indicated that the α diversity is lower in healthy individuals, which is consistent with our results. 52,53However, Jin et al. reported that the α diversity was lower in the LC population and decreased steadily at more advanced stages. 16Interestingly, similar trends were seen in the study by Greathouse et al. using lung tissue instead of BALF. 54Thus, α diversity may not be an appropriate indicator of lung health.Pulmonary disease alters the composition of LM, as evidenced by an increase in pathogenic bacteria such as Escherichia coli (E.coli), Streptococcus pneumoniae, and Haemophilus influenza. 55This phenomenon was also observed in our study.E. coli was selected as our biomarker.7][58] However, the mechanism by which it plays a role in lung carcinogenesis is still unclear, and only its potential relevance has been demonstrated. 59Some studies have confirmed that E. coli can promote the induction of IFN-α, IFN-β, and ISG in lung cells, eventually leading to lung injury. 60,613][64] E. coli, as a typical parasitic bacterium of the gut, is frequently found in the lungs of LC patients.Its transmission routes, including the fecal-oral or bloodstream transmission route, need to be explored in future studies.
In terms of LC diagnosis, in addition to the single microbiota study, the combination of multiple biomarkers also exhibits good performance for diagnosing LC.According to the results described by Lu et al., a model using a combination of micro biomarkers (AUC gut + sputum = 0.82) showed improved performance for patient stratification compared to a model using an individual dataset (AUCgut = 0.76; AUC sputum = 0.75). 20Notably, in studies of other cancer types, the use of a combination of different biomarkers (carcinoembryonic antigen, carbohydrate antigen, blood microorganisms, etc.) for the construction of classifiers has increased the predictive power compared to that observed with original single microbial data. 38,65,66ome studies also included clinical indicators such as age and BMI to enhance the accuracy of the diagnosis. 67Unfortunately, there is no dual-model research on combining microbiota with blood in the LC field.Only a few correlation studies have been conducted.For instance, Chen et al. and Wang et al. used the Spearman rank correlation test to determine the association between different microorganisms and metabolites and the mechanism of potential bacterial flora involved in interventions in the metabolism and development of new blood biomarkers. 18,19We shall attempt to meta-analyze more types of biomarkers that help to improve our diagnostic performance in future studies.
Besides diagnosis, ML also has promising applications in the prognosis, classification, grading, and treatment optimization of cancer patients. 68For example, Acidovorax and Veillonella in the sputum can help diagnose squamous cell carcinoma with 80% sensitivity, while Capnocytophaga can be used to identify lung adenocarcinoma with 72% sensitivity. 22Regarding prognosis, both LM and GM significantly differ between groups exhibiting long and short progression-free survival durations. 69,70Based on the oral microbiota, the predictive potential could reach 0.89. 23In terms of the therapeutic effects of chemotherapy or immunotherapy, numerous studies have been carried out to screen response micro biomarkers and predict clinical side effects. 40,71,72Interestingly, microbiome-based classifiers all use RF algorithms.It can be seen that the RF classifier has become the preferred choice for the study of this kind of problem. 73otably, extensive literature reviews have shown that less attention has been focused on targeted therapy and radiotherapy, representing a significant gap in the research on LC-related microbiota.Furthermore, ML techniques have been used to improve drug research, drug discovery, pharmacokinetic prediction, and drug treatment prediction. 74,75For example, AI-based drug-drug interaction prediction models have been constructed to facilitate the fundamental application of drug therapy and support clinical decisions. 76Hung TNK et al. extracted datasets from the DrugBank database as raw information.Then, ML algorithms (RF and XGBoost) were used to construct ML models for predicting the DDIs of the Osteoporosis-Paget disease, which exhibited an average accuracy of nearly 74%. 77However, the use of AI for LC drug selection has not been developed effectively yet.In the future, it would still be necessary to study the optimal decision algorithms for selecting the best compounds and provide personalized drug therapy to LC patients. 78In addition to traditional ML algorithms, deep learning can help build deep networks continuously and learn and approximate real models. 79This particular ML technology would help us achieve powerful learning and diagnostic capabilities in the future.However, the generalizability of the data and the interpretability of the algorithm, the "black-box problem," and the problems of data access and medical ethics in the real-world application are all challenges we face in the future. 80espite our findings, there are some limitations associated with this study.First, we cannot obtain information on fungi and viruses using 16 s rRNA sequencing technology.Hence, the scope of research would be limited to bacteria.Moreover, this functional prediction and taxonomic resolution by 16 s rRNA data are not as good as that of WGS data.However, the cost of WGS for extensive screening needs to be considered.Second, our sample size and sampling area need to be expanded, and a larger sample size from multiple centers is required for modeling and validation.Finally, transcriptomics and metabolomics studies should also validate functional analysis results.Despite their limitations, the present results are of great significance for identifying LC biomarkers and shall contribute to the study of pathogenesis in the future.In particular, the discovery of gut biomarkers has been beneficial for the non-invasive diagnosis of LC.Diagnostic accuracy can be improved in future studies by refining subtypes, determining LC stages, and considering more clinical indexes.Moreover, the combination of ML with real-world data streams such as genomics, pathology, and electronic health records would help facilitate a powerful electronic synthesis that would be necessary for the further development of modern medicine.
In summary, we analyzed the composition of the gut and lung microbiome of LC patients via 16 s rRNA gene sequencing and constructed a diagnostic model in this study.The results show that LM has a higher diagnostic value than GM.However, GM is a promising candidate for developing non-invasive diagnostics for a wide range of early screening-related processes.Moreover, we also screened some micro biomarkers based on multi-population samples that may be useful as entry points for targets that suppress LC carcinogenesis in the future.Our findings support the opinion that the LC population has a characteristic microbial composition and could provide some insight into the development of microbiology-based early screening kits.
effectiveness of LC-specific microbiological markers in multiple populations and contributes to the early diagnosis and screening of LC.K E Y W O R D S 16 s rRNA, gut microbiota, lung cancer, lung microbiota, machine learning | 19303 HAN et al.

T A B L E 1 *
Characteristics of the included datasets in a systematic review.The taxonomic profiles for a total of 16 stool samples from the Human Microbiome Project (HMP), as provided by MetaPhlAn2 (http://segat alab.cibio.unitn.it/tools/metap hlan2/), were used as a healthy control in the taxa comparison.Abbreviations: BALF, bronchoalveolar lavage fluid; EUR, Leicester, Manchester, and Coventry [UK]; Munich, Marburg, and Freiburg [Germany]; Ferrara [Italy]; Warsaw [Poland]; and Budapest [Hungary]); NA, not available.

3. 3 |
Alteration of the microbiota of the lung cancer population 3.3.1 | Characteristic changes in the gut microbiota

F
I G U R E 1 Variance explained by disease status (LC vs. Normal) is plotted against variance explained by the study of gut (A) and study of lung (B) effects for individual microbial species.The significantly differential OTUs are colored in blue, and P values were from the two-way ANOVA test.The abundance of each genus is plotted proportionally to the dot size.LC, lung cancer; OTUs, operational taxonomic units; ANOVA, Analysis of Variance.relationships of the GM species.The evolutionary map of species branching during LEfSe analysis has been shown in Figure 2F.The data showed that the dominant flora of the two groups was significantly different and was synchronized with the histogram showing the distribution of LDA values.

F I G U R E 2
Alterations of gut microbiota composition in lung cancer.(A) Differences in α diversity between Lung cancer and Normal group based on the standardized OTUs table, Chao 1, Shannon, and Simpson indices.(B) The β diversity was evaluated by PCoA based on Bray Curtis distance, which shows the GM composition was different among groups.(C) Phylum-level taxonomic profiles of LC patients and normal individuals.(D) Genus-level taxonomic profiles for the two groups of samples.(E) Histogram of the distribution of LDA values for LEfSe analysis of two groups.(F) Evolutionary map of species branching for LEfSe analysis of GM in two groups.OTUs, operational taxonomic units; PCoA, principal coordinate analysis; GM, gut microbiota; LC, lung cancer; LDA, Linear discriminant analysis; LEfSe, Linear discriminant analysis Effect Size.

3. 3 . 2 |
Alterations in the composition of lung microbiota in lung cancer patients

F I G U R E 3
Microbial composition and difference analysis of lung microbiota in the Lung cancer group, Benign group, and Normal group.(A) Comparisons of αdiversity between different groups.α diversity, measured with the Chao 1, Shannon, and Simpson index, was computed with all OTUs in all samples.(B) β diversity as shown by PCoA of Bray Curtis distances.(C)Taxonomic composition at phylum level in lung samples.(D) Taxonomic composition at genus level in lung samples.(E) Differential taxa identified by LEfSe with LDA values of 2. Taxa enriched in different groups are displayed by color indicated in the key (red indicating taxa abundant in the LC group, blue in the Normal group, and green in the Benign group).(F) Cladogram showing the phylogenetic distribution of microbiota associated with three groups.OTUs, operational taxonomic units; PCoA, principal coordinate analysis; LEfSe, Linear discriminant analysis Effect Size; LDA, Linear discriminant analysis; LC, lung cancer.

3. 5 |
Diagnostic models for lung cancer based on different micro-ecological loci 3.5.1 | Construction of microbial classification models

F I G U R E 4
Differences and relationships between gut microbiota and lung microbiota.(A) Gut and lung microbiota differed significantly in terms of α diversity.Comparisons of Richness index (Normal group, left; Lung cancer group, right).(B) Overlap of gut and lung biomarkers of lung cancer in Venn diagram.

F I G U R E 5
Co-occurrence network of the microbiota in the different groups.The correlation coefficient was calculated with the Spearman rank correlation test (| ρ | ≥ 0.3).(A) Correlation networks in Normal group of gut microbiota.(B) Correlation networks in Lung cancer group of gut microbiota.(C) Correlation networks in Normal group of lung microbiota.(D) Correlation networks in Benign group of lung microbiota.(E) Correlation networks in Lung cancer group of lung microbiota.Each circle represents the average relative abundance of a microbial species in that state.Node sizes are scaled according to their degrees of connection.The thickness of the line represents the strength of the relationship.F I G U R E 6 Construction and validation of a diagnostic model based on gut and lung-specific microbiota.(A)The ROC of LC versus Normal group classification model based on the GM.(B) The ROC of the LC versus Normal group classification model based on the LM.(C) The ROC of the LC versus Benign group classification model based on the LM.(D) Validation of GM classification model in two independent cohorts.(E) Cross-prediction matrices were constructed using the study-to-study transfer validation and LODO validation values for gut classifiers.The value on the diagonal is the cross-validation result of a single study, and the off-diagonal is the cross-validation result between cohorts.(F) Validation of clinical applicability of GM classification model.A and B are different groups with the same clinical parameters, Sex (A/B, Male/Female); Age (A/B, age>60/age ≤ 60); BMI (A/B, BMI≤24/BMI>24); Cancer type (A/B, LUAD/SCC); Glucose (A/B, 70 < Glu < 120/Glu≥120).(G) The specificity of gut predictive models.Lung-related diseases were used to validate the specificity of the GM classification model: TB (n = 58) versus Normal (n = 22) model, asthma (n = 45) versus Normal (n = 14) model.(H) Specific Validation of the LC versus Normal group model based on the LM: IPF (n = 45) versus Normal (n = 28) model, CHP (n = 110) versus Normal (n = 28) model.(I) Specific validation of the LC versus Benign group model: ILD (n = 18) versus Benign (n = 15) model, COPD (n = 34) versus Benign (n = 15) model.ROC, receiver operating characteristic curve; LC, lung cancer; GM, gut microbiota; LM, lung microbiota; LODO, leave-one-dataset-out; BMI, body mass index; LUAD, Lung Adenocarcinoma; SCC, squamous carcinoma; Glu, glucose; TB, tuberculosis; IPF, idiopathic pulmonary fibrosis; CHP, Chronic hypersensitivity pneumonitis; ILD, interstitial lung disease; COPD, chronic obstructive pulmonary disease.